Catalan-English Statistical Machine Translation without Parallel Corpus: Bridging through Spanish

نویسندگان

  • Adrià de Gispert
  • José B. Mariño
چکیده

This paper presents a full experiment on large-vocabulary Catalan-English statistical machine translation without an English-Catalan parallel corpus, in the context of the debates of the European Parliament. For this, we make use of an English-Spanish European Parliament Proceedings parallel corpus and a Spanish-Catalan general newspaper parallel corpus, both of which of more than 30 M words. Given the language proximity between Spanish and Catalan languages, we investigate the cost of using Spanish as a bridge towards large-vocabulary Catalan-English translation in a wholly automatical statistical machine translation framework. Experimental results are promising, as the achieved translation quality is nearly equivalent to that of the Spanish-English language pair, practically carrying SMT research for the Catalan language to the level of more prominent language, in terms of data availability.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Catalan-English statistical machine translation without a parallel corpus

This paper presents a full experiment on large-vocabulary Catalan-English statistical machine translation without an English-Catalan parallel corpus, in the context of the debates of the European Parliament. For this, we make use of an English-Spanish European Parliament Proceedings parallel corpus and a Spanish-Catalan general newspaper parallel corpus, both of which of more than 30 M words. G...

متن کامل

A Large Spanish-Catalan Parallel Corpus Release for Machine Translation

We present a large Spanish-Catalan parallel corpus extracted from ten years of the paper edition of a bilingual Catalan newspaper. The produced corpus of 7.5 M parallel sentences (around 180 M words per language) is useful for many natural language applications. We report excellent results when building a statistical machine translation system trained on this parallel corpus. The Spanish-Catala...

متن کامل

Development of Language Resources for Speech-to-speech Translation

This paper describes the creation of linguistically enriched aligned corpora for Catalan, Spanish and US-English for Speech-to-Speech Translation. These corpora are obtained from two diierent sources: US-English transcribed speech data and transcriptions of conversations recorded in Catalan and Spanish. After human translation, a large trilingual spontaneous speech corpus has been obtained. Thi...

متن کامل

Towards the Use of Word Stems and Suffixes for Statistical Machine Translation

In this paper we present methods for improving the quality of translation from an inflected language into English by making use of part-of-speech tags and word stems and suffixes in the source language. Results for translations from Spanish and Catalan into English are presented on the LC-STAR trilingual corpus which consists of spontaneously spoken dialogues in the domain of travelling and app...

متن کامل

Persian-Spanish Low-Resource Statistical Machine Translation Through English as Pivot Language

This paper is an attempt to exclusively focus on investigating the pivot language technique in which a bridging language is utilized to increase the quality of the Persian–Spanish low-resource Statistical Machine Translation (SMT). In this case, English is used as the bridging language, and the Persian–English SMT is combined with the English–Spanish one, where the relatively large corpora of e...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006